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Dual Averaging on Compactly-Supported Distributions 
And Application to No-Regret Learning on a Continuum 

Walid Krichene 


Abstract 

We consider an online learning problem on a continnum. A decision maker is given a compact 
feasible set S, and is faced with the following sequential problem: at iteration t, the decision maker 
chooses a distribntion G then a loss function : S —>■ R+ is revealed, and the decision maker 

incurs expected loss We view the problem as an online convex optimization 

problem on the space A(S') of Lebesgue-continnuous distributions on S. We prove a general regret bound 
for the Dual Averaging method on L^{S), then prove that dual averaging with ti;-potentials (a class of 
strongly convex regularizers) achieves sublinear regret when S is uniformly fat (a condition weaker than 
convexity). 


1 Introduction 

We consider an online learning problem on a compact subset S C K". At each iteration t € N, a decision 
maker chooses a distribution on S', then, a loss function : S —> M+ is revealed, and the decision 
maker incurs loss This is summarized in Problem 


Problem 1 Online decision problem with Lipschitz losses on S. 

1: for t G N do 

2: Decision maker chooses distribution over S. 

3: A loss function : S —>■ K+ is revealed. We assume is L-Lipschitz. 

4: The decision maker incurs expected loss 

5; end for 


The regret of the decision maker is defined as follows: for a given sequence of losses and a 

corresponding sequence of decisions the cumulative regret at time t, denoted by 



— inf 
sGS 




compares the expected loss cumulated by the decision maker to the infimum of the cumulative loss function. 
In particular, we seek to design algorithms for which the regret grows sub-linearly in t, for any sequence of 
losses in a given class (the assumptions on the losses will later be made explicit). 

This sequential decision problem has a long history which dates back to Hannan m and Blackwell [S] , who 
formulated the problem in the context of repeated games. The notion of regret is closely related to the notion 
of consistent play (as defined by Hannan) and approachability (as dehned by Blackwell). Beyond player 
dynamics in repeated games, online learning has many applications such as portfolio optimization [HE] 
and machine learning [12] . 

Regret minimization is essential in the design and analysis of online learning algorithms 0i, and the 
study of player dynamics in repeated games [Hill ESI ED- In this article, we study the problem of designing 
sublinear regret algorithms under minimal assumptions on the feasible set S and the sequence of losses (£^*^). 






Assumptions on 

convex 

Q-exp-concave 

uniformly L-Lipschitz 

Assumptions on S 

convex 

convex 

u-uniformly fat 

Method 

Gradient descent [21] 

Hedge [IT] 

Dual Averaging with strongly convex 

/-divergence s.t. /(*) = 0(2;^+') (Section 3l 

Learning rates 

1 

t 2 

a 

1 

t 2 + rie 

/t 


0{t-^ log t) 
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Table 1: Regret upper bounds for different classes of losses. 

When the feasible set S is finite, and the losses are uniformly bounded, the Hedge algorithm [3], 
also known as the multiplicative weight updates ^ or the exponentiated gradient method m, is known to 
achieve sublinear regret, and is easy to analyze and to implement. More general classes of algorithms with 
sublinear regret have been developed since. For example, the the online mirror descent algorithm [8], an 
extension of the mirror descent method due to Nemirovski and Yudin [20j . is shown to have sublinear regret 
for any choice of strongly convex distance-generating function. Similarly, the dual averaging method |21j is 
shown to achieve sublinear regret for any choice of strongly convex regularizer. Remarkably, both of these 
families of algorithms include the Hedge algorithm as a special case. 

When the set S is infinite, designing sublinear regret algorithms requires making additional assumptions 
on the class of loss functions as well as the feasible set S. In [21], Zinkevitch considers an online 

problem on a convex S, for convex loss functions He shows that a simple gradient descent algorithm 
is guaranteed to have regret which grows as 0{\/i). In [T7], Hazan et al. also study the online learning 
problem on convex S, and show that for some classes of loss functions, one can achieve logarithmic regret, 
i.e. = O(logf). In particular, they show that logarithmic regret is achieved by the Newton method when 
the losses are a-strongly convex, and by the Hedge algorithm when the losses are a-exp concave (uniformly 
in t). 

In this article, we design sublinear regret algorithms under mild assumptions on the feasible set and 
the sequence of losses. In particular, we only assume that the losses are Lipschitz-continuous, and relax the 
convexity assumption on the set S. Our main result is summarized in Table[^ together with regret bounds for 
other classes of loss functions. We show that one can formulate the online learning problem as an optimization 
problem over a convex subset of L^{S), allowing us to use results from (infinite dimensional) convex analysis. 
By applying the dual averaging method of Nesterov to T^(5') we prove, in Section]^ a general regret bound 
which holds for any choice of regularizer. In Section]^ we consider a particular class of regularizers, which 
can be expressed as Csiszar divergences of w-potentials, and we derive sufficient conditions on the potential 
to (i) make the dual averaging solution efficiently computable, and (ii) to guarantee that the regret grows 
sublinearly on any sequence of uniformly Lipschitz losses. This results in a general class of algorithms which 
are efficient to implement and which have sublinear regret guarantees under mild assumptions on the feasible 
set and the class of losses. In Section]^ we give concluding remarks, connections with related problems, and 
directions for future work. 


2 Dual averaging on ms) 

We start by applying Nesterov’s dual averaging method [H] to our sequential decision problem viewed as 
an online optimization problem on a convex subset of L^(S'), and derive a general regret bound for this 
algorithm. 

2.1 Dual Averaging on a Hilbert space 

Consider a Hilbert space E, and a feasible set X C E, assumed to be closed and convex, and let || • || be a 
reference norm on E (not necessarily the norm induced by the inner product). 












Let ip ■. X ^ IR+ be proper, continuous, and Frechet-difFerentiable on the interior of X, denoted by X. 
The Bregman divergence associated to ip is defined as follows: 

D^jj : X X X ^ K+ 

(a;, y) ^ D^{x, y) = ip{x) - tp{y) - {S/ipiy), x - y) 

The function ^ is said to be ^^-strongly convex with respect to a reference norm |j • || for all x,y £ X x X, 

D^{x,y) > 

It is L^-smooth with respect to || • || if for all x, ?/ € T” x X, 

D^{x,y) < ^\\x-yf 

As we describe below, strong convexity and smoothness are dual properties. Define the Fenchel-Legendre 
conjugate of ip 

fp*{y) = - inf'!/'(a:) - {y,x) 

x^Pc 

Note that the minimum is attained and the minimizer is unique since tp is strongly convex and X is closed 
and convex (Theorem 11.9 in [5]). The gradient of ip* is 

Vip*{y) = argmin'0(a:) — {y^x) 

which we will refer to as the Bregman projection onto A, since it can be written as 

\/ip*[y) = argmin^/^(a:) — (yipiy'ip~^{y)),x) = dxgTah\D^{x,'S/ip~^{y)) 


Proposition 1. If ip is t^-strongly convex with respect to || • ||, then ip* is ^-smooth with respect to the dual 
norm || • |j*. 

Proposition is an extension of Theorem 18.15 in [3] to general norms, the proof is provided in the 
Appendix. 

Given a sequence of linear functionals in the dual space E*, the method projects, at each step, 
the cumulative dual vector scaled by a step size rjt+i, onto the feasible set, using the 

Bregman projection \/ip*. This is summarized in Algorithm Without loss of generality, we will assume 
that inf 2 ,g;f ip{x) = 0. 


Algorithm 2 Dual averaging method with input sequence and learning rates (pt) 

1; for t S N do 
2 ; Define 
3: Update 

2 ,(*+i) = Ip*= argmin a:\ H- ^—ip{x) (1) 

x(^x \ / Tyt+i 


4: end for 







2.2 Dual Averaging on L^(S') 

In particular, we consider the case where E = L^(S), the Lebesgue space of square integrable functions on 
S, endowed with the inner product {f,g) = fg f{s)g{s)X{ds), where A is the scaled Lebesgue measure such 
that X{S) = 1. Let the feasible set X be 

X ■= {f G L^{S) : / > 0 a.e. and /g /(s)A(ds) = 1} 


Note that while X is closed and convex, it is unbounded: if A is a measurable subset of S', then G X, 

and which can be arbitrarily large. 


An element f G X will be be identified with the probability distribution on S with density /. The 
dual space is E* = L‘^{S), and since S is compact, E* contains, in particular, the set C^{S) of continuous 
functions on S. Problem can be viewed as follows: at each iteration t, the decision maker chooses an 
element of X, then an element G C^{S) C L^(S) is revealed, and the decision maker incurs the expected 
loss Next, we define the regret and provide a first bound on the regret of the dual averaging 

method. 


Definition 1. Let be a sequence of elements of L^{S), and consider the dual averaging algorithm on 

this sequence, with learning rates {rjt). The cumulative regret of the algorithm is defined as 




= sup 



— inf 

xGX 



X 


The regret is said to be sublinear if < 0. 

The regret compares the cumulative loss of the algorithm, to the best cumulative loss 

of any stationary distribution x (note that the infimum may not be attained). 

Lemma 1 (Dual Averaging Regret). Consider the dual averaging method with dual sequence (£^*^) and 
learning rates (? 7 ^*^). Suppose that ip is £^-strongly convex w.r.t. || • ||, and that the losses are bounded in the 
dual norm, uniformly in t, i.e. there exists M > 0 such that for all t, < M. Then for all t and all 

X G X, 




—ip{x) 


m2 




Proof. Define the potential function 

V 

We first show the following inequality: 

( 2 ) 

Since ip is £^-strongly convex w.r.t. || • ||, by Proposition ip* is ^-smooth w.r.t. || • ||*, therefore 
D^*(-77tL(‘),-77iL(‘-i)) < - ryL(^-P\\l i.e. 



Thus 


It remains to show that 771 —>■ ^( 77 , G) is decreasing. Taking the derivative with respect to 77 , 

a,C(77,L) = \ri-vL) - ^ (Vr(-77L),-L) 

jjZ jrj 

= \ + {Vri-VL),VL)) 

< \Ip* (0) by convexity of ip* 

jjZ 

= — ^ inf ip(x) = 0 

772 xex 

which proves inequality ([^. Summing, and using the bound on we have 

x: (7<'>, x''>) < f ( 7 .. i'‘>) -. i <”>)+E 

T—1 ^ r=l 

By definition of we have 

= —^^*(0) = inf 7/>(a;) = 0 

770 770 xex 

= — inf l'qtL^*\x) +ip[x) < (l^*\x) + —ip{x) 
vt \ / \ / m 

Therefore 

T—\ ' ^ T—1 


which proves the claim. 


□ 


Note that the regularizer ip can be unbounded on S. This is true for example for the entropy regularizer 
ip{x) = fgx(s) ln(x(s))A(ds), which we will use in Sectionj^ Thus to obtain a useful (sublinear) bound on 
the regret, it may not suffice to take a supremum in the bound of Lemma This motivates the following 
Theorem. In what follows, we will assume that the loss functions are Lipschitz, uniformly in time. Let 
Sf S argminsgs (since the loss functions are continuous and S is compact, the minimum is attained). 

Intuitively, if the losses are Lipschitz, then infa;^;^ , x) is well approximated by the cumulative loss of 
distributions which concentrate their mass around Sf. 

Theorem 1 (Dual Averaging Regret for Lipschitz Losses). Suppose that ig L-Lipschitz, and ||^ < M, 
uniformly in t. Then the dual averaging method with learning rates pqt) guarantees the following bound on 
the regret: For any positive sequence (dt), 


^ ELi Vr+l 

t ~ t 


+ Ldt + - -infjjgBj ip{x) 

tVt+i 


( 3 ) 


where Bt C X denotes the set of Lebesgue-continuous densities supported on B{s^,dt). 
Proof. First, we observe that 




t 

mf (x,<)5](^M,x(-))-L(‘)(sn 

T — 1 




Since the losses are L-Lipschitz, we have Vx € Bt, 





Thus, for all x G Bt, 


t 

E 

T — l 


i^'^\s)x{s) ds < f + Ldt)x{s) ds = + tLdt 

r=l 


T^l 

t 

< ^ — x^ + tLdt 

T = 1 



where the last inequality uses Lemnia[2 We conclude by dividing by t and taking the inhmum over x G Bt- □ 

We now have a general regret bound for the dual averaging method applied to Problem In the next 
section, we further study the dual averaging algorithm with a particular family of regularizers, and we study 
their properties. 


3 Dual Averaging with w-potentials on Uniformly Fat Sets 

We now study the dual averaging method when the regularizer is the /-divergence, or Csiszar divergence in 
of a particular class of potential functions. This definition is a generalization of [5] to our infinite dimensional 
Hilbert setting. 

3.1 Csiszar divergence induced by w-potentials 

Definition 2. Let uj < 0 and a G (—oo,-|-oo]. An increasing diffeomorphism 4> '■ (—oo,a) (a;,oo), is 

an Lo-potential if 

lim (fW) = ^ 1™ 4‘(a) = +00 

u—^ — oo u—^a 

We associate to 4> the function f^, defined on (0,oo), 

Uix) = (j)~^{u)du 

which is, by definition, convex (since is increasing), and satisfies /^(l) = (Q We also associate the 
f-divergence, defined on X by 

i’f 4 ,i^)= [ M^{s))X{ds) 

JS 

By convexity of f, we have for all x G X, ipf^{x) > (fg x(s)ds) = /<^(1) = 0. 

Example 1 (Euclidean projection). Perhaps the simplest instance of lo -potential is the identity 4>{u) = u, 
for which lo = —oo and a = -boo. In this case, f<j,{x) = ^ , and the resulting Csiszar divergence is 

ipf^fx) = i Jgx{s)^X{ds) — i The dual averaging method then projects, at each iteration, in the 

L^ norm, the dual vector —ryt+i feasible set. 

^Note that in the original definition of Audibert at al., the function is taken to be the integral from u to uj + x. Our 
definition corresponds to a translation of in order to have /^(l) = 0, which guarantees that 'il)f^{x) > 0 on 


(j) ^{u)du < oo 









Figure 1: Illustration of an w-potential. 


Example 2. More generally, if p > 1, then taking 4>(u) = sign(M)|u| is an co-potential with to = —oo and 
a = + 00 . In this case, (j)~^(u) = sign(u)|it|P“^ and f^{x) = and the corresponding Csiszdr divergence 


is fof^{x) 


P 


Example 3 (Entropy projection). If we take (j){u) = e'^ ^, then the corresponding density function is 
Uix) = Ii {u)du = fpl + In u)du = a;lnx, and the associated f,p divergence is the negative entropy 



x{s) lna;(s)A(ds) 


See Section^for a generalization of the entropy divergence. 


3.2 Strong convexity 

In order for the bound of Theorem]^ to hold, we need the regularizer to be strongly convex. In this section, 
we give sufficient conditions on the potential for strong convexity of with respect to p norms, defined as 
follows for p > 1 

\\x\\,= (^Jjx{s)\^X{ds)y . 

Theorem 2. Let (j) be an co-potential, and suppose that there exists a > 0, 2 o > 0 and r G (0,1] such that 
{(j)~^)'{z) > all z > 0. Then f^j, is l-strongly convex w.r.t. the p norm, with I = nnd 

p = Y^- That is, for all x,y G X, 


Du{x,y) > 


2a(l + zo)"'''^ 


y\\ 


2 

2 

1 + r 


Proof. By definition, the /-divergence associated to the potential (j) is differentiable at any x £ ?! with x > 0 
a.e., and has gradient 


"^'4’u{x)[s) = (j) \x{s)) 


Du y) = '^(^) - ^’(y) - (VV’(y), x-y) 

= f Ui^is)) - f^yis)) - (f~'^{y{s)){x{s) - y{s))\{ds) 
Js 

= [ f (l)~'^{u)du- cj)-\y{s)){x{s) - y{s))X{ds) 

JsJv(s) 


Thus, 

















and by a Taylor expansion of ^ there exists z G [x, y] such that 


f (t> ^{u)du-(l> ^{y{s)){x{s)-y{s)) = {(t> ^/(^(s)) 
•Jyis) 


{x{s)-y{s)f 


> l _(a:(s) - 2 /(s ))2 


by assumption on </> 


2 a{z{s) + ZoY 

Now by the Cauchy-Schwartz inequality, if p, g > 1 are conjugate, i.e. ^ 1 = 1, then for any (a, 6) G "H* x H, 


{a,b) < ll«llpl!^llg; and it follows that whenever ||6||,j > 0, |ja||P > • Applying this inequality with 


( a(z-szo)O '’’ ^ ^ {cx{z + zoY)p, we have 


f (x(s) -p(s))- 
Is a{z{s) + zoY 


X{ds) > 


Is(xis) - y{s))p Xjds) 

[is + zoY)^ Xids))” 


l|a;-y||| 

a\\z + Zo\Y 


In particular, if we take p=l-|-r, q = l-|-i, then ^ = 1, and \\z zo\\rR = \\z + 2o||i = 1 -h since 
z € [x,y\ <Z X and zq > 0. Therefore 


DfYx,y) > 


1 

2 


(x(s)-y(.s)Y 

a(z(s) + ZoY 


X(ds) > 


1 \\x-yYj^ 

j:_ _ l±r_ 

2 a{l + zoY 


which concludes the proof. 


As a consequence of Theorem we can show that Begman divergences of Example i Yf 


II lip 


□ 

-1 


p G (1, 2], are strongly convex w.r.t. 


p 


Ip- 


Corollary 1. Let p G (1,2), and consider the uj-potential </>(«) = sign(u)|M| p-i, and its corresponding 


Csiszdr divergence Yi^) = 


Then Y is {p — l)-strongly convex w.r.t. || • || 2 . 


Proof. We have Y = sign(M)|u|P thus for z > 0, 




P- 1 

z2-p 


where 2 —p G (0,1) by assumption on p. Thus we can apply Theoremj^with r = 2 — p, a = and Zq = 0, 
which proves the claim. □ 


Note that the corollary also holds for p = 2, since one can explicitly compute the Bregman divergence: 
we have Y{^) = 


D^{x,y) = Y{x)-Y{y) - (^Y{y),x-y} = ^{\\x\\l - \\y\\l -2{y,x-y)) = ^\\x - y\\l 

which is 1-strongly convex w.r.t. || • |j 2 . Finally, we observe that a similar result is proved, in the finite¬ 
dimensional case, in [3], Lemma 8.1: for p G (1,2], i|| • ||p is (p — l)-strongly convex w.r.t. |j • ||p. Note that 
the result concerns || • \\p while Corollary concerns || • ||]]. The squared p-norm is not a Csiszar divergence 
induced by an w-potential in general (except for p = 2). Using || • ||]] as a regularizer instead of || • \\p allows 
us to benefit from the properties of w-potentials; in particular, the solution of the dual averaging iteration 
can be computed efficiently, as discussed in Section |3.3[ 

Next, we give another sufficient condition for strong convexity w.r.t. || • ||i. 














Proposition 2. Let (p be a oj-potential, and assume that p is a -diffeomorphism. Consider the potential 


density as in Definitionld Let w = 1 + ^ ■ V f<l> satisfies one of the following eonditions Vu > 0; 


1 /r(i) 




f"(l) 

(Mu) - 4 ( 1 )(m - 1 )) (1 + (1 - w){u - 1 )) > -^^{u - 1 )^ 

/ f"'{u) \ 

sgn{u - 1) + (1 - - 1)] + 3(1 “ j > 0, 


then the f^-divergence is strongly convex with respect to the total variation norm. More precisely, for all 
x,y G X, 

Proof. By Definition if 0 is a diffeomorphism, then is three times differentiable on (0, oo), and 
for all a: > 0, f"{x) = {(j)~^y{x) = which is, by assumption on p, strictly positive. Thus by 

the generalized Pinsker inequality in [T3] (Theorem 3 and Corollary 4) the associated /-divergence satisfies 
Df^{x,y) > ^ DTv{x,yY, where is the total variation norm, DTv{x,y) = |||a: —This concludes 
the proof. □ 


3.3 Solution of the Bregman projection with oa-potentials 

We now characterize the solution of the dual averaging update, given by the Bregman projection in equa¬ 
tion 0. 

Proposition 3. Let cj) be an uj-potential. Let S E*, and consider the dual averaging iteration 0 in 
Algorithm^^ with the regularizer ip taken to he the divergence associated to (p. Then the solution 
is given by 


a;(*+i)(s) = -h v*))+ 

where ?/+ denotes the positive part ofy, and v* satisfies Jg (p{—r]t+i{L^*\s) + iy*))+X{ds) = 1. 

Proof. Let K be the cone K = {x G Lfi{S) ■. x>Q a.e.}, and let 

f{x) = Il^*\x) -\ - ^ip{x) +iK{x) 

\ ! Vt+i 

where ix is the indicator function of the cone K, i.e. ixis) = 0 ii s G K and -l-oo otherwise. The dual 
averaging iteration is eqnivalent to the following problem: 

minimize:j,gL 2 (s) f{x) 
subject to (l,a:) = 1 

where 1 : S' —?► K is identically equal to 1. Using the fact that the subdifferential of the indicator ix is the 
normal cone (See for example Chapter 16 in [3]) given by Wx G K 

diK{x) =Nk{x) = {g^ ■ sup {g,y-x) < o|, 

^ y^K ^ 


the subdifferential of the objective function is 

df{x) = H- ^(p~^{x) + Nk{x) 

Vt+i 










First, we show that, for all x and all g G Nk{x), g < 0 and gx = 0 almost everywhere. Indeed, fixing x G K, 
we have {g,y — x) < 0 for all y G K. In particular, x + lg>o G K, thus 

0 > ( 5 . lff>o) = f g+{s)X{ds) 

Js 

which proves that g+ = 0 a.e.. Furthermore, taking y = G K, we have 

1 If 

0> { 9 ,y - x) = -- {g,x) = - |g(s)|x(s)(is 

s 

which implies that \g\x = 0 a.e., which proves the claim. 

Now, consider the Lagrangian C : E x'R 

£{x, v) = (l^*\x') H- ip{x) + ixix) + x) — 1) 

\ ' Vt+i 

Then (x*, v*) is an optimal pair only if 

0 G H- —cj)~^{x*) + Nk{x*) + i/l 

Vt+i 

{l,x*) = l 

see for example Section 19.3 in [3]. We can rewrite the stationarity condition as follows: 

3g* G Nk{x*) such that H- (j)~^(x*) + + g* = 0. 

Vt+i 


Therefore, 

x*{s) = + IX* + g*{s)) a.e. 

g* G Nk{x*) 

(l,x*) = l 

In particular, let support(a;*) = {s : x*{s) > 0} (a measurable set). By the complementary slackness 
condition, g*x* = 0 a.e., therefore g* = 0 a.e. on support(a;*). And for a.e. s ^ support(a:*), we have 

(j){-T]t+i{L^*Hs) + V*) < + I/* + 5 *(s))) = 0 

since (j) is increasing and 5 * < 0 a.e.. Therefore the optimality conditions become 

x*{s) = (j){-r]t+i{L^*\s) + 1 ^*))+ 

f + iy*))+ds = 1 

JS 


which proves the claim. □ 

Proposition shows that the solution of the Bregman projection [l] is entirely determined by the dual 
variable v*, therefore computing the solution reduces to computing the optimal v*. Furthermore, we observe 
that the function v i-G Jg + ^*))+Mds) is increasing, by assumption on </>, therefore one 

can compute v* (to arbitrary precision) using a simple bisection method. Note that in general, the solution 
2 ;(*+i) may not be supported everywhere on S, unless w = 0, in which case (j) is by definition, strictly positive. 





3.4 Regret analysis 

Next, we show that under the appropriate assumptions on the feasible set S, and the asymptotic behavior of 
the w-potential, it is possible to achieve sublinear regret with dual averaging. First, we focus our attention 
on sets which are uniformly fat (using the definition of |19jb a generalization of convexity. 

Definition 3. Consider a subset S C M". S is said to be v-uniformly fat if there exists v > 0 such that for 
all s G S, there exists a convex Kg C S such that s G Kg and X{Kg) > v. 

In particular, if S is convex, it is 1-uniformly fat. 



Figure 2: Illustration of uniform fatness (left) and the homothetic transformation used in the proof of 
Proposition 1^ (right). 


Intuitively, the uniform fatness condition guarantees that there is sufficient volume around any point of 
S', so that the solution of the Bregman projection assigns enough probability mass around the optimum. 
In particular, uniform fatness excludes isolated points. 

The next Proposition gives a regret bound on uniformly fat sets. For a subset C C K", we denote the 
diameter of C by D{C) = sup^^^/g;;; ||s — s'||. 

Proposition 4. Suppose that S is v-uniformly fat. Let ft be an oj-potential, and take the regularizer to be 
the Csiszdr divergence f^. Suppose that f^ is i^-strongly convex w.r.t. || • ||, and that is bounded by 

M. Let (dt) be a sequence of positive numbers. Then for all t, 


Rjt) ^M^ T.UVr+1 
t ~ t 


+ LD{S)dt + -^vd'if^ 
tm+i 



(4) 


Proof. Let Sj G argmin^gg L^^\s). Since S is u-uniformly fat, there exists a convex subset Kt C S, containing 
S(, such that X{Kt) > v. Following the argument in [TS] and [T7], consider the homothetic transformation 
of Kt, given by 

St = {dt{s-st),sGS} 

The diameter and mass of St satisfy 


D{St) = dtD{Kt) < dtD{S) 
X{St) = d^X{Kt) > vddf 


Let ust be the uniform distribution over St. Since ust has support in B{st, dtD{S)), we have, by Theorem]^ 


El=l Vr+l 

f “ 24 t 


+ LD{S)dt + 

tVt+i 


where, by dehnition of the divergence. 


V'(«sJ = f Uil/X{St))X{ds) = U{l/X{St))X{St) 

JSt 






To conclude, we observe that is increasing on [l,oo). Indeed, by definition of an w-potential, f^{x) has 
derivative and since (j)~^ is increasing, for all x > 1, (j)~^{x) > (l)~^{u)du = 0. Therefore 


^(usj = UiUHStmSt) > vddiu ( ^ 


□ 


which concludes the proof. 

Next, we show that if has a bounded asymptotic growth rate, the regret grows sublinearly. 

Theorem 3. Suppose that S is v-uniformly fat. Let (f be an ui-potential such that fcj, is 1.^1,-strongly convex 
with respect to || • ||, and suppose that is L-Lipschitz and < M uniformly in t. Suppose that there 

exists e > 0 and C > 0 such that 

U{x) < Cx^+^ 

for all X > 1. Then the dual-averaging method with f^-divergence and learning rates ry = 9t~°‘ satisfies the 
following bound on the per-round regret 




< 


M'^e 


t - 24(1 - a) 
= O 


+ {LD{S) + Cv^)t-^- 


The rate is optimal for a = which 




= 0 




Proof. Let (dt) be a decreasing sequence which converges to 0. By Proposition]^ the regret of the dual 
averaging method is bounded by 


^ m2 rj. 


< 

t - 24 


+ LDiS)dt + —vdU4f;^ 


try 


where we can bound X]t=i °‘d,u = 6 J^u °‘du = j^t^ “■ By assumption on f^, we 

have 


Combining these bounds, we have 


< 


vdU^ 


M^e 


\vd^ 




+ LD{S)dt + CvH’^-^d.^ 


e^a.— 1 ^—ne 


t - 24(1 - a) 

Taking dt = t~ i+"«, the bound becomes + Cv~'^)t~ 1 +"', which proves the 

claim. □ 

One can formulate similar regret bounds under different assumptions on the asymptotic behavior of f^. 
For example, one can show the following extension, proved in the Appendix. 

Theorem 1^1. Under the assumptions of Theorem^ suppose that there exists e > 0, iz > 0 such that 

f^{x) = O {x^+'^ilnxy) 

and the learning rates are taken to be pt = Q a = 2 +ne then 


= O 


2 1 2+ 










To conclude this Section, we observe that while the regret is defined with respect to elements of X 
(Definition!^, it is equivalent, for uniformly fat sets, to the regret with respect to elements of S', in the 
following sense: 

inf 

t 

— ^ — min (s) (5) 

Proof. Let € argmin^gg. Then it suffices to show that for all e > 0, there exists x G X such that 

\r=l / T=1 

Fix e > 0. Since S is u-uniformly fat, there exists a convex set Kt C S containing Sf, with M^t) > 0. Let 
St be the homothetic transform of Kt, of center and ratio dt, as in the proof of Proposition Then we 
have 

D{St) = dtD{Kt) < dtD{S) 

\{St) = d^X{Kt) > 0 

Now consider x = have x G X, and since the are uniformly L-Lipschitz, 

^E/ + LdtD{S))ds 

T^lJSt X(ot) 

t 

= tLdtD{S) + J2^^'^Hst) 

T — 1 

In particular, if we choose dt = tLD{S) ’ {J2t=i ^ which proves the 

claim. □ 


4 Entropy dual averaging 

Let w < 0, and consider the w-potential 

f) : {—oo,oo) —>■ (w,oo) 

u !->• (f){u) = e““^ + uj 

Its inverse is = I + ln(u — w). In particular, we have p (j)~^{u)du = {u — uj) ln(u — w)|j!) < oo. The 

resulting density function is simply 

f(f,{x) = J (j)~^{u)du = (x — uj) ln(x — w) — (I — w) ln(l — uj) 

and the associated divergence is the following generalized negative entropy 

'tpf,i,{x) = / (a;(s) — w) ln(a;(s) — a;)A(c?s) — (1 — w) ln(l — w) 

Js 

= H{x — uj) — H{1 — uj) 






where H{x) = Jgx{s)\nx{s)\{ds). By applying Proposition we can derive the explicit solution of the 
dual averaging update Q. 



Figure 3: Illustration of the entropy w-potential (j){u) = e“ ^ + w. 

Corollary 2. Consider the dual averaging method on the set of distributions X, regularized by the negative 
entropy. The solution of update Q is given by 

. . / C) \ 


where is the appropriate normalization constant. In particular, when oj = Q, ^ ~ 

and ZW = s), we recover the Hedge algorithm m- 

By definition of the entropy potential function, we have {(f)~^)'{z) = therefore, the assumptions of 
Theorem hold with a = 1, Zq = —w, r = 1, thus the negative entropy is j^-strongly convex with respect 
to 11 • 111, and we can apply Propositionand Theorem]^ to obtain a regret bound. 

Corollary 3. Suppose that S is v—uniformly fat, and that the loss functions are L-Lipschitz, uniformly in 
time, and that < M. Then the dual averaging method on X, regularized with the negative entropy. 


with learning rates rjt = Q , has a sublinear regret such that = O • 

Proof. By definition of the negative entropy, we have(a;) = (x —w) ln(a; —w) — (1 — w) ln(l — w) = 0(xhix), 
and the results follows by Theorem |^1. □ 

In fact, we can obtain a more explicit upper bound on the regret. By Proposition]^ we have 


t ~ 2 t ^ ^ 


-^vdfU ( 

tpt+i ^ \vd]! 


and we bound 




First, we have 


f(j,{x) = (x — oj) ln(a: — w) — (I — w) ln(I — oj) 

< (x — oj) ln(a: — w) 

< CujXlnx + d^ 


for a; > 1 


where = I + ln(l — cu) and d^j = (1 — to) ln(l — uj). To prove the last inequality, let 5(x) = (x — cv) ln(a; — 
w) — Cojxlnx + duj. Then (5(1) = 0 and for all a; > I, 

X — UJ , a;(l — uj) , i_„ 

^71--<0 

x<^‘^[\ — uj) a:'^“(l —w) 


5’{x) = ln(a; — w) + 1 — c^{l + Ina;) = In 
















Thus, whenever d* <1, we have —W > 1, and 


vdl 




, In ■ 


vddl 


d^vd” 


and taking dt = 1/t, we have 

^ {1 — Uj)]\'P ^ LD{S) ^ Cui{n\nt— hiv) + dujV/f^ 

t ~ 2 t t tVt+i 

In particular, when a; = 0 (i.e. for the Hedge algorithm), Ci^ = 1 and d^j = 0, and the bound simplifies to 

LD{S) n\nt-\nv 

t ~ 2 t t tVt+i 

and we recover the bound on the Hedge regret obtained in [19] . 

5 Concluding Remarks 

We studied a sequential problem in which a decision maker chooses, at each iteration, a distribution over 
a compact set S, then observes a loss from the class of L-Lipschitz continuous functions on S. Viewing 
the problem as an online convex problem over L'^{S), we applied the dual averaging method and derived 
a general regret bound. Then we studied dual averaging with Csiszar divergences induced by w-potentials, 
and showed that for this class, the Bregman projection Q can be computed efficiently, assuming one can 
efficiently evaluate integrals on S (e.g. by using a MCMC method). We then provided sufficient conditions 
on the asymptotic behavior of the potential to guarantee (i) strong convexity of the Csiszar divergence 
(Theorem 1^, and (ii) a sublinear regret (Theorem]^. These sufficient conditions provide guidance in the 
design and analysis of dual averaging methods, which we illustrated with one particular family of w-potentials, 
using entropy regularizers. 

Another approach for learning on a continuum consists in applying a discrete learning algorithm on a 
finite cover of S. Since the loss functions are L-Lipschitz, the additional regret incurred due to learning 
on the cover is at most L times the diameter of each element of the cover. Therefore, in order to have 
asymptotically sublinear regret, one would need to refine the cover as the number of iterations grows. Our 
method does not require explicitly computing a cover, since it samples directly from a distribution defined 
on S. It has the potential of being more computationally tractable (since one does not need to compute 
and refine a cover) but ultimately, its complexity is that of sampling from the sequence of distributions 
(x*-*^), thus studying the computational complexity of the proposed dual averaging method requires making 
additional assumptions on the family of loss functions and the feasible set S. 

A related problem is bandit learning, in which the decision maker plays, at iteration t, an action 
drawn from the distribution x^*\ then only observes the loss of that action, £^^'>as opposed to the full 
loss function This problem is studied for example in [7|, for Lipschitz losses, and when the feasible set 
S is given by an explicit hierarchical formulation. We are currently investigating extensions of this work to 
the bandit setting. 
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A Omitted proofs 


A.l Proof of Proposition 


Proposition 

dual norm j| • 


I 


If V' is £^-strongly convex with respect to ]] • ]], then '0* is 


^-smooth with respect to the 


Proof. Let yi,y 2 G E*, and Xi = V0*(yi). Since Xi is the minimizer of the convex function x i—>■ 'ip{x) — {yi, x), 
we have, by first-order optimality, 

ffi/ifixi) - yi,x - Xi) >0yx € X 


In particular, we have 


(V 0 (a;i) - yi,X 2 - xi) > 0 

(V0(a;2) - y2,xi -X2)>Q 


and summing both inequalities. 


(2/2 -yi,X2- Xi) > {\7lf{x2) - \7tp{xi),X2 - Xi) 


By definition of the Bregman divergence, we have 


(2/2 -yi,X2- Xi) = (V' 0 (a: 2 ) - \/tp{xi),X2 - Xi) 

= D.,p{xi,X2) + D.ii,{x2,Xi) 

> i.^\\x2 - XiW^ 

and by definition of the dual norm, we have {y2 — yi,X2 — xf) < \\y2 — 2 /i|UI| 3^2 — since jjj/jj* = 
sup||,^ll<i {x,y) > ( 11 %,2/)- Therefore, 

II2/2 - yi\\A\x2 - a::i|| > f-ii,\\x2 - 

rearranging, we have 11x2 - xi\\ < j^||2/2 - yi\\*, i-e. 

||V0*(?/2) - V0*(?/i)ll < -^11^2 - yiW* 

t'ljj 


(6) 



Now by definition of the Bregman divergence, we have 


D^.{x,y) = - {V i/;* (y), x - y) 

= [ {'^'^*{y + t{.x-y))-V'ip*{y),x-y)dt 
Jo 

<\\y-x\\* [ \\V'il^*{y + t{x-y))-\''ip*{y)\\dt 
Jo 


< \\v- 


r 1 

y-^\\* / -r\\y +- y) - y\\*d't 

Jo ^ 

1 

<-^\\x-y\\l tdt 
^ Jo 


by 


ti/, 

= ^W^-yWll 


A.2 Proof of Theorem IHll 

Theorem 1^1 Under the assumptions of Theorem suppose that there exists e > 0, > 0 such that 

Uix) = O {x^+%\nxr) 

and the learning rates are taken to he r]t = Q , a = 2 +ne then 


□ 


i?(b 




Proof. The proof is similar to that of Theorem]^ Let {dt) be a decreasing sequence which converges to 0. 
By Proposition]^ 

where X]t=i ^ = 0 ((lnt)“3t““^), and by assumption on 


{vd^ 


vd^U ( ^ 1 < Cv-^df^^ ( In - + nln ^ ) =0{ ( In ^ 


Combining these bounds, we have 




Taking dt = t 1+'*', the bound becomes 


t 


= O ((lnt)5t-“ + t-^(lnt)^ 


which proves the claim. 


□ 





